Termhood-Based Comparability Metrics of Comparable Corpus in Special Domain

نویسندگان

  • Sa Liu
  • Chengzhi Zhang
چکیده

Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be given to suit various tasks about natural language processing. A new comparability, namely, termhood-based metrics, oriented to the task of bilingual terminology extraction, is proposed in this paper. In this method, words are ranked by termhood not frequency, and then the cosine similarities, calculated based on the ranking lists of word termhood, is used as comparability. Experiments results show that termhood-based metrics performs better than traditional frequency-based metrics.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Survey on Comparable Corpora until June 2012

Here we present a survey of important work done on Comparable Corpora between the period 1995 to 2012. Unlike parallel corpora, which are clearly defined as translated texts, there is a wide variation of non-parallelism in comparable text. Non-parallelism is manifested in terms of differences in author, domain, topics, time period, language. The most common text corpora have non-parallelism in ...

متن کامل

An Improved Corpus Comparison Approach to Domain Specific Term Recognition

Domain specific terms are words carrying special conceptual meanings in a subject field. Automatic term recognition plays an important role in many natural language processing and knowledge engineering applications such as information retrieval and knowledge mining. This paper explores a novel approach to automatic term extraction based on the basic ideas of corpus comparison and emerging patte...

متن کامل

A Study on Terminology Extraction Based on Classified Corpora

Algorithms for automatic term extraction in a specific domain should consider at least two issues, namely Unithood and Termhood(Kageura,1996). Unithood refers to the degree of a string to occur as a word or a phrase. Termhood (Chen Yirong, 2005) refers to the degree of a word or a phrase to occur as a domain specific concept. Unlike unithood, study on termhood is not yet widely reported. In cla...

متن کامل

WI&CRF: روش پیشنهادی برای استخراج اطلاعات مورد نیاز از متون نظامی

Military Information Extraction techniques are interested for military managers and commanders. But usual information extraction techniques cannot be used for that domain, because military corpus has special structure that differs from non-military corpus. In this paper the military documents structure is compared with non-military documents structure. Moreover a new classification is proposed ...

متن کامل

Bilingual Terminology Extraction Using Multi-level Termhood

Purpose: Terminology is the set of technical words or expressions used in specific contexts, which denotes the core concept in a formal discipline and is usually applied in the fields of machine translation, information retrieval, information extraction and text categorization, etc. Bilingual terminology extraction plays an important role in the application of bilingual dictionary compilation, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012